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Abstract 

Consider two random strings having the same length and generated by an iid 
sequence taking its values uniformly in a finite alphabet. Artificially place a long 
block into one of the strings, where a block is a contiguous substring consisting 
only of one type of symbol. The long block replaces a segment of equal size and its 
length is smaller than the length of the strings, but larger than its square-root. We 
show that for sufficiently long strings the optimal alignment corresponding to the 
Longest Common Subsequence (LCS) treats the added long block very differently 
depending on the size of the alphabet. For two-letter alphabets, the long block gets 
mainly aligned with the same symbol from the other string, while for three or more 
letters the opposite is true and the long block gets mainly aligned with gaps. 

We further provide simulation results on the proportion of gaps in blocks of var- 
ious lengths. In our simulations, the blocks are "regular blocks" in an iid sequence, 
and are not artificially added. Nonetheless, we observe a similar phenomenon for 
the natural blocks as the one shown for the artificially- added blocks: with two let- 
ters, the longer blocks get aligned with a smaller proportion of gaps. For three or 
more letters, the opposite is true. 

It thus appears that the microscopic nature of two-letter optimal alignments and 
three-letter optimal alignments are entirely different from each other. 

1 Introduction 

Let X and y be two finite strings. A common subsequence of x and ?/ is a subsequence 
which is a subsequence of x and at the same time a subsequence of ?/, while a Longest 
Common Subsequence (LCS) of x and ?/ is a common subsequence of maximal length. 

LCS are often used as measures of strings relatedness, and below we only consider align- 
ments which align same letter pairs. Every such alignment defines a common subsequence, 
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and the length of the subsequence corresponding to an ahgnment is called the score of 
the alignment. The alignment representing a LCS is also called an optimal alignment. 

Longest Common Subsequences (LCS) and Optimal Alignments (OA) are important 
tools used in Computational Biology and Computational Linguistics for string matching 
[1]; [H], [9], and in particular, for the automatic recognition of related DNA pieces. 

The asymptotic behavior of the expectation and of the variance of the length of the LCS 
of two independent random strings has been studied by probabilists, physicists, computer 
scientists and computational biologists. The LCS problem can be formulated as a last 
passage percolation problem with dependent weights, and as such the problem of finding 
the fluctuation order for first and last passage percolation has been open for a long time. 

Chvatal and Sankoff [5] showed by a subadditivity argument that 

7fc := lim 



where X and Y are two stationary strings independent of each other taking values in an 
alphabet of size k. Even for the simplest distributions, such as iid strings with binary 
equiprobable letters, the exact value of 7^ is unknown but many simulations have led to 
very good approximate values. For example in the iid case, lower and upper bounds are 
available, e.g., 

l„ n o A 

(1-1) 

where the precision in the above table is about ±0.01. 

Alexander [H |2] further established the speed of convergence of ELC„/n to 7^ for iid 
sequences showing that 
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2 


3 


4 




ll 


0.812 


0.717 


0.654 





7^n-CVnlogn<ELC„ <7*n, (1.2) 

where C > is a constant depending neither on n nor on the distributon of Xi. 

Therefore, the expected length of the LCS of two iid sequences, both of length n, is 
about 7^71, up to an error term of order not more than a constant times y/n\ogn. We can 
also consider, two sequences of different lengths, but so that the two lengths are in a fixed 
proportion to each other. To do so, let 

lk{n,q):= ^ ^, (1.3) 

n 

where q E [—1,1], and let 

7fc(g) := lim 7fc(n,g), (1.4) 

n— >oo 

which again exists by subadditivity arguments. The function 7^ : g h-)- 7^(5) is called the 
mean LCS-curve; it is symmetric around q = and concave, and thus has a maximum at 
q = 0. It corresponds to the wet-region-shape in first passage percolation. 

Throughout the rest of this paper, let LCn denote the length of the LCS of X and Y: 

LCn:=\LCS{X-Y)\. 
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The main results (Theorem 12.11 and Theorem I2.2p of the paper are concerned with 
sequences of length n = 2d. It describes the effect of replacing an iid piece with exactly 
one long block of equal length. The artificially inserted long block has length with 
1/2 < P < 1. It is shown that typically replacing an iid part by a long block of same 
length leads to a decrease in the LCS. It is also shown, that the long block gets aligned 
mainly with gaps in the binary case, while with three or more letters the opposite is true. 
To illustrate our results, consider the sequences 

0100000001 

and 

0010111010, 

where the bold faced letters are those of the added block. Theorems 12.11 and 12.21 assert 
is that the optimal alignment behaves very differently depending upon having binary 
sequences or sequences with three or more equiprobable letters: In the binary case the 
long block gets mainly aligned with bits, while with more letters it gets mainly aligned 
with gaps. This holds with high probability and assuming d to be sufficiently large. In 
the current example, an LCS of the two strings 0100000001 and 0010111010 is given by 
00000 (the LCS is not unique here), which corresponds to the optimal alignment 
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(The optimal alignment is also not unique.) But note that in the above example of 
optimal alignment all the zeros from the long block (in bold face) got aligned with zeros 
and not with gaps. Our result shows that for binary sequences, the artificially added 
long block gets aligned with very few gaps as soon as d is taken large. More precisely, 
the number of gaps has an order of magnitude smaller than the length of the long block. 
With three and more letters we prove the opposite. In other words, with three and 
more letters, the long block gets aligned almost exclusively with gaps. The situation 
for three and more letters is not surprising, but the situation for the binary sequences 
is rather counterintuitive. Our proof is for when d goes to infinity. Nonetheless, the 
phenomenon is observed in simulations for regular blocks which have not been artificially 
added: with binary sequences longer blocks tend to be aligned with a small proportion 
of gaps, while with more letter the opposite is true. (More examples are present at the 
start of the subsection 12. 1[ ) We thus seem to have uncovered an interesting phenomenon, 
in that the microstructure of the optimal alignment of iid sequences for binary sequences 
is fundamentally different from the case with more letters. It is another instance (see [B]) 
where the size of the alphabet in a subsequence problem plays an important role. 

As for the content of the paper. Section [2l states the results and gives some of the 
main ideas behind them, while the proofs are presented in Section [31 In addition to its 
own interest, the present paper serves as background to showing that the variance of the 
LCS of two iid random strings with many added long blocks is linear in the length of the 
strings (see [3]). 
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2 Main ideas and results 



In this section we formally state our main results and explain some of the ideas behind 
their proofs. Below, both sequences X and Y have length 2d and in the middle of the 
sequence X, there is a long block of length about i = d^. The results are consequences 
of concentration inequalities combined with the following facts: 

1. A first ingredient is that 7^(^,0') converges, uniformly in q to 7fc(g), at a speed of 
y/ln n/^/n. More precisely, Alexander in j2] Example 1.4 and Theorem 4.2, shows 
that there exists a constant C > independent of n or g G [—1,1] such that 



7fc n,g -7, g < 2.1 

'n 



for all n and all g G [— 1, 1]. (This order also follows from Hoeff ding's inequality.) 

2. When strings with only one symbol are aligned with another iid string with equiprob- 
able letters, the LCS is typically much shorter than for two iid strings with equiprob- 
able letters. 

Let us present this last fact. Let v = 000000, w = 100101 and so LCS{v, w) = 000 and 
\LCS{v,w) \ = 3. Hence, the length of the LCS is the number of zeros in the string w. If 
w is an iid string with k equiprobable letters but v consists only of zeros and both strings 
have same length, then typically the LCS in this particular situation has length about 
\w\/k. This is typically much less than for two iid sequences with equiprobable letters, 
where the typical LCS length is about 7^1^ |. One can compare 1/k with 7^ and see, that 
1/fc is significantly smaller than 7^ for all k > 2. Though 7^ is not known exactly, the 
available bounds clearly show such an inequality. 

The general idea on how to prove the bias effect on the LCS of the long block can 
be summarized as follows: When for two sequences of length 2d we replace an iid part 
by one long block, this causes an expected loss. The variance cannot make up for it, 
because by Hoeffding's inequality, the standard deviation is at most of order y/d. But we 
chose the length of the long block i = d^ (with /3 > 1/2) to be of an order of magnitude 
above \/d. That the expected effect of replacing an iid piece with a long block of equal 
length is linear in i, is shown in Section [31 In this present section, we present below the 
main idea underlying this fact. For this we first consider two sequences of length d (and 
not 2d). We analyze the effect of adding i = d^ iid symbols at the end of only one of 
the two sequences. This is presented in paragraph I) below. Note, that the argument 
only works if the mean expected LCS-curve has a derivative at its maximum. Moreover, 
the explanation given in I) below, also holds, when the sequences have length close to d 
instead of being exactly equal to d. In the case, of sequences of length 2d with one long 
block replacing an iid part, typically the long block gets aligned "approximately in the 
middle of the other sequence." Hence, one can use for the 2(i-length case, the argument 
presented in I). This is explained in II) below. 
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I) Effect of adding I = symbols to one sequence only. Assume that V and 
W are two independent iid strings of length d with k equiprobable letters. Assume that 
we increase the length of W by adding c?^ iid symbols. The symbols are drawn from the 
same alphabet as the strings V and W , and again the symbols are equiprobable. We 
are interested in finding out the typical size of the increase in the LCS score due to the 
addition of these d^ letters. Let the increase be denoted by A so that 

A := \LCS{V- W1W2 . . . Wd+,,_^Wa+df^)\ - \LCS{V- W)\. 

We assume d large. We are first going to look at the expected gain: EA. For large d, 
¥.\LCS{y,W)\ is approximately equal to 7^c?. Assuming that the derivative of the map 
7fc(-) is defined at its maximum points, we find that the expected gain E|A| is about equal 
to ^Id^ /2. (Without the assumption on the derivative, it is not clear how to obtain the 
size of the increase when we add symbols only on one of the strings.) Let us explain how 
the order 7^(i^/2 can be obtained. By the very definition of 7fc(-, ■), 

Using the inequality (12. ip . the last equality above becomes 

which can also be written as 

EA = 7.(0)y + ^ ^ - + [Vdh^dJ . (2.2) 

Now, d^ /{2d + d^) — )■ 0, as d — )■ 00. We also assumed that 7fc(-) has a derivative at 0, and 
by symmetry that derivative can only be equal to zero. Therefore, 



lim 



7fc(2£F)-7fc(0) 



2d-] 



and 

7fc (i^f) -7fc(0) 



dP 2 

2d+dP 



— = o{d^). (2.3) 



Since (3 > 1/2, d'^ is oi much larger order than y/dlnd. Using this together with equation 
(12. 3 p in (12. 2p yields the desired order of magnitude: 

EA = :^^ + oid% (2.4) 
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where by definition 7^(0) = 7^. So, we got the order of magnitude of the expectation of 
A. The order of magnitude of A with high probabihty is the same as its expectation, the 
reason being that the standard deviation of A is of order: 

= OiVd), 

which is much smaller than the order of EA since 1/2 < /3. Indeed, in our context, the 
variable A is a function of the iid entries V1V2 . . .Vd and W1W2 ■ ■ ■ Wd+df^- If we change 
only one of the entries, A changes by at most 2 units. Hence, by Hoeffding's inequality 
and setting u = 2d + d'^ 

P(|A -EA| >t)< 2exp 

for all t > 0. Therefore, integrating out (12. 5 p gives 

Var A<Au = 8d + 4d^. 




II) Effect of replacing the long block by iid. The previous argument can be com- 
bined with the situation where we add an iid string of length d'^ to one sequence of length 
d to tackle the case of two sequences of length 2d. This is explained in Subsection 13.11 and 
then in detail in the rest of Section [31 Here, let us already give a somewhat over-simplified 
summary: 

Assume now that X = X'^BX'^, where X is the string of length 2d we consider and B 
is the long block of length d^. Assume that a is an alignment aligning X with Y which is 
an optimal alignment, i.e., an alignment corresponding to an LCS. Let Y"-, Y^, resp. Y'^ 
be the piece of Y aligned with X"', B, resp. X'^ by a. We will show (see Section [3]) that 
typically Y"" and Y'^ must have about the same length as X"" and X'^. In other words, the 
four strings X", X^, y and Y'^ have typically length about equal to d. So, the result in 
I) applies to X" and F", despite the lengths being approximately d instead of exactly d. 
Assume now that Y'' has length h, which should be of linear order in I = d^ . We modify 
a to obtain a new alignment a. For this we keep the way X'^ and Y'^ are aligned with each 
other, but instead of aligning the long block B with F^, we "add Y^ to the alignment of 
X"- with y " . By the phenomenon explained in I) above, we get in this way, an increase 
of about 7fc/2 times the length of F**. Thus typically, an increase of h'yl/2. Before, the 
part Y'' was aligned with the long block, so only with one type of symbol, and so before 
there were about h/k symbols of Y^ not aligned with gaps. Hence, the change in number 
of aligned letters is about: 




Therefore whenever 7^/2 — l/fc>0 (which is the case for three or more letters), then 
the change typically increases the number of aligned letters. This leads to a contradiction 
and implies that a is not an optimal alignment. Hence, for three or more letters the long 
block can typically not be aligned with a piece Y'^ of linear order in i. In other words, for 
three letters or more the long block is typically mainly aligned with gaps. 
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Ill) For which k do we have > 2? Whether an artificially added long block gets 
mainly aligned with gaps or not depends on whether 7^/2 is larger or smaller than 1/k. 
It turns out, that 7^ is smaller than 2/k only when we consider binary strings, that is 
when k = 2. For every /c > 3, the opposite inequality is satisfied. Despite the exact values 
for 7^ not being known, there are rigorous bounds available, precise enough to show our 
assertions. In any case, for large k, Kiwi, Loebl and Matousek [7] have shown that 7^ 
is of linear order in l/y/k, making 7^/2 strictly larger than 1/k when k is large enough. 
With = 3, we are pretty close to a critical point, as can also be seen in our simulations. 
Taking the value of 0.717 in Table [TTTl then 73/2 = 0.3585, which is above 1/3. This is 
pretty close, specially since the order of magnitude of the precision by which the values 
7^ are known is about 0.01. In computational biology, the most important case, is = 4 
and for 7^ > 2/k to hold, we need 74 > 1/2. In our table we gave the approximate value 
of Y4 ~ 0.654. 

2.1 Results for an added long block 

In this subsection, we state precisely the following result: The existence of a derivative of 
the function 7^ at every maxima implies that if an above average long block is inserted in 
the middle of an iid sequence, then the proportion of symbols from the long block aligned 
with gaps is typically close to zero for binary sequences; while for three or more letters 
the opposite holds and typically the proportion is close to one. In the present, subsection 
both sequences X and Y have length 2d, while the long block has length about i = d^, 
with 1/2 < /3 < 1, while the results hold, for d large enough. But, let us first explain 
precisely what we mean by "the proportion of symbols in a block aligned with gaps" : 

Let us first explain how we count the gaps a block gets aligned with. For this let x 
be the string x := 00011100 and let y := 00011001. The first block of x consists of three 
zeros, its second block consists of three ones and the third block consists of two zeros and 
the LCS of X and y is 

LCS{x;y) = 0001100, 
which corresponds to the alignment 



X 


00011100 


y 


00011 001 


LCS 


00011 00 



In the above alignment, the first block of x is aligned with no gaps, while the second block 
of X is aligned with one gap. So 1/3 of its symbols get aligned with gaps. Finally the last 
block of X is aligned with no gaps, so there the proportion of symbols aligned with gaps 
is zero. 

Let us next present an example to illustrate how with two letters long blocks tend to 
be aligned with a proportion of gaps close to zero, while with three and more letters the 
opposite is true: 
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For this consider first the two binary strings: y = 01111001011011011101001 and 
X = 10010111100000101101101. The ahgnment corresponding to LCS is 



X 


10010111100 010110110 1 


y 


1 111 00101101101 1101 001 



We observe that the long block of five zeros in x is aligned with no gaps indicating that 
every from the long block is ahgned with a 0. 

Let us next consider an example with six letters. For this let y — 55425153112422255656 
and let x = 65324214444412356631. The ahgnment corresponding to the LCS 542112566 
is 



X 


65 3242 1 44444123 56 631 


y 


55 4251531 12 422255656 



The long block of 5 fours in x is aligned with gaps. The strings x and y in the previous 
example arc "generated" solely in the following way: Throw a fair 6-sided die indepen- 
dently to obtain the strings everywhere except in the location of the long block. For the 
long block, i.e., for the piece a;8a;9a;ioa;iia;i2, decide in advance to introduce artificially a 
long block: xg — xg — ■ ■ ■ — xu- Outside that piece, throw the six-sided die indepen- 
dently, hence, for the die is thrown independently 7 times, and similarly 
for 3^13X14X15X10X17X18X19X20 and for the whole string y. 

Let us at this stage describe formally the model with one inserted long block and 
sequences of length 2d: we consider two independent random strings X = X1X2 . . . X2d 
and Y — Y1I2 • • • of length 2d. A long block of length about £ (integer length £ + 1) 
is inserted artificially into the middle of the string X. (We assume that £ is even.) This 
means that a long block replaces an iid part of equal length. Thus: 

P {Xd-(e/2) = Xd-{£/2)+l = . . . = Xd+(£/2)-l = Xd+(l/2)) = 1- 

For the rest, the strings are iid with k equiprobable letters. (Hence, Y, X1X2 ■ ■ ■ Xd-(e/2)-i 
and Xd+(e/2)+iXd+{e/2)+2 ■ ■ ■ X2d^iX2d arc three independent iid strings. Moreover, P(Xj ~ 
j) = p(r. = j) = (l/k), for all j = l,2,...,k and alH = 1, 2, . . . , 2d.) 
Let (3 and ^1 be two constants independent of d and such that 

^ < A < /3 < 1, (2.6) 

and let the length of the long block £ be equal to £ = d^. To formulate our main result 
on one-long-block inserted, we need two definitions: 

Let to be the event that the long block is mainly aligned with gaps. More 
precisely, we define A"' to be the event that no more than d^'^ symbols of the long block 
do not get aligned with gaps in any LCS-alignment. This is the same as requesting that 
the score docs not decrease by more than d^^, when we cut out the long block: 

\LCS{XiX2 . . . Xd-(e/2)-iXd-{e/2)Xd+{e/2)Xd+{e/2)+i ■ ■ ■ X2d;Y)\ + d^^ > LCS{X;Y), 

where £ — d^. 
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Let O'^ be the event that when replacing the long block with iid symbols 
gives an increase of about 7^/2 times the length of the long block. More precisely 
let 7^ be any constant strictly smaller than 7^ and not depending on d. Formally, the 
event O'^ is the event that when replacing the long block by iid the length of the LCS 
increases by at least (7^/2) cZ^ — d^^. (Note that since /3i < /3, we have that the increase 
is of order a constant times d^.) Hence O'^ is the event that: 

\LCS{X*; Y)\ - \LCS{X; Y)\ > ^d^^ - d^\ 

where X* denotes the string obtained from X by replacing the long block by iid. In other 
words, for i G [1, - (£/2) - 1] U [d + (£/2) + 1], we have X* := X^. The whole string 
X* = X*X; . . . X*rf is iid. 

As already mentioned, having three or more letters, is equivalent to the condition 
7^/c > 2. We are now ready to formulate formally our main result for three or more letters 
and one long block inserted when both strings X and Y have length 2d: as previously 
mentioned, it asserts that with high probability the long block gets mainly aligned with 
gaps and replacing the long block by iid symbols represents an increase in the length of 
the LCS: 

Theorem 2.1 Let ^f^k > 2, and let also the mean LCS function 7^ : [—1, 1] K, be 
dijjerentiable at all its maxima. Let l/2</3i</3<l. Then, there exists a constant 
C > 0, independent of d, such that 

F{A^)>l-e-^'''''-\ 

and 

¥{0^)>l-e-^''^'''-\ 

for all d> 1. 

Since wc chose 2f3i > 1, the theorem above asserts that the events and O'^ occur 
with high probability as soon as d is of medium size. 

Let us next give the result for the two letter case. For this we first need some more 
definitions: 

Let G"^ be the event that the long block gets mainly aligned with symbols and 
not with gaps. More precisely let is the event that the long block has at most d^'^ 
of its symbols aligned with gaps. (Recall that the length of the long block is d^ where 
f3 > (3i and so for large d, d^^ becomes negligible in comparison to the length of the long 
block d'^.) Saying that no more than d'^^ of the symbols of the long block get aligned with 
gaps in any optimal alignment, is equivalent to the following: leaving out d^'^ symbols 
from the long block makes the LCS decrease by at least one unit. Hence, formally the 
event can be described by 

\LCS{X;Y)\ > \LCS{XiX2 . . ■ Xd-(e/2)Xa_(^^/2)+di^i+i^d-{e/2)+di^i+2 ■ ■ ■^2d',Y)\- 

(Note that X1X2 . . . Xd-(e/2)Xa-(^e/2)+dh+iXd-(i/2)+dh+2 • • • is simply the string 
X1X2 . . . X2d from which the piece Xd-(e/2)+iXd-(e/2)+2 ■ ■ ■ Xd-(t/2)+dPi is cut out.) 
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Let if" be the event that replacing the long block with iid symbols increases 
the LCS by at least cnd^- Here Cij > is any constant not depending on d and such 
that 

< I - 1- (2.7) 

More precisely, H"^ is the event 

\LCS{X*;Y)\ - \LCS{X-Y)\ > cnd^. 

Let us next formulate our main result for the 2-letter case with an additional long block 
added. It states that with high probability most of the symbols from the long block are 
not aligned with gaps. It also states that when we replace the long block with iid symbols, 
typically the LCS-length increases proportionally to the length of the long block. 

Theorem 2.2 Let 7^A; < 2, and let the m,ean LCS function 72 : [—1, 1] — ?■ M, be diff'eren- 
tiable at all its maxima. Then, there exist constants Co, Ch > 0, independent of d, such 
that 

and 

for all d > 1. 

The situation we encountered for two letters might seem counterintuitive at first. Let 
us explain why: Consider for this two binary sequences of length n where one string is 
made out only of ones and the other is made of zeros and ones both symbols having the 
same probability. Then the length of the LCS is the number of ones in the sequence 
with both symbols. If both symbols have probability 1/2, then the length of the LCS 
will be approximately 1/2 times the length of the strings. However for two binary iid 
sequences, the average length of the LCS is about 0.8 of the total length. Hence, the 
LCS is much greater for two iid sequences, than when one sequence is made up of only 
one letter (i.e., one sequence is just "one long block"). Thus, one would think that when 
within a sequence one gets an exceptionally long block, then this should typically decrease 
the total LCS. Hence, since a long block "scores" much less than a typical piece of string 
iid drawn, one would expect that the long block tends to be "left out" and not used too 
much (and hence aligned with many gaps). But the opposite is true. Also, in optimal 
alignment similar strings tend to be matched. Since a long block, is very different from 
an iid string, it would seem that a long block should be "left out" and mainly matched 
with gaps. This typically happens with three or more letters, but with two letters, the 
opposite is true. 

2.2 Simulation and consequences for the nature of alignments 

In an iid sequence of length 2d it is very unlikely to find a block of length d^. Typically 
the blocks might reach a length of linear order in In d. Nonetheless what we prove for an 
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artificially added long block, can be observed in simulations for blocks with lengths which 
might occur naturally. In this subsection wc present the result of our simulations. 

The first table gives a simulated estimate of the expected number of gaps in a block 
of length i placed in the middle of a string of length 1000 (except for ^ > 100 where we 
took a string of length 4000) in function of the number k of letters. Since there might 
be several optimal alignments, we chose the one putting a maximum number of gaps into 
the long block. Also, note that placing a block of length which might naturally occur, is 
like finding in an iid sequence a block of that length and not adding a long block. 





£ = 1 


£ = 2 


£ = 5 


£=10 


£ = 20 


£ = 30 


£ = 50 


£= 100 


£ = 200 


£ = 300 


£ = 400 


k = 2 


0.53 


1.67 


2.25 


2.75 


4.2 


6.17 


8.16 


14.68 


12.26 


14.2 


19.6 


k = 3 






2.85 


4.6 


12.5 


18 


32.3 


70.64 


152.6 


226 




k = 4 


0.72 


1.19 


3.27 


6.78 


16.3 


25.6 


43.8 


88.4 








k = 5 




1.6 


3.36 


7.76 


16.3 


27.1 


49.7 


96.2 








k = 6 




1.43 


3.67 


8.32 


17.2 


28.2 


47.7 


97.1 








k = 7 




1.53 


3.82 


8.6 


18.7 


27.9 


48.6 


98.1 








k = 9 






4.23 


8.7 


18.4 


29.2 


48.4 











For each entry we ran 100 independent simulations. For each simulation, we find the 
number of gaps the block of length £ gets aligned with. Then, we computed the average 
of that number of gaps over the 100 simulations. This gave the entries of the above table. 
The next table gives an estimate for the expected number of gaps divided by the length of 
the block. This means that the next table is obtained from the previous one, by dividing 
each entry by the value £ corresponding to its column. The entries in the next table thus 
represent the "proportion of gaps" in the long blocks depending on the length of the long 
block: 





£ = 1 


£ = 2 


£ = 5 


£=10 


£ = 20 


£ = 30 


£ = 50 


£= 100 


£ = 200 


£ = 300 


£ = 400 


k = 2 


0.53 


0.83 


0.45 


0.27 


0.21 


0.20 


0.16 


0.14 


0.06 


0.04 


0.04 


k = 3 






0.19 


0.15 


0.62 


0.6 


0.64 


0.7 


0.76 


0.75 




fc = 4 


0.72 


0.59 


0.65 


0.67 


0.81 


0.85 


0.87 


0.88 








k = 5 




0.8 


0.67 


0.77 


0.81 


0.90 


0.99 


0.96 








k = 6 




0.7 


0.67 


0.83 


0.86 


0.94 


0.95 


0.97 








k = 7 




0.75 


0.76 


0.86 


0.93 


0.93 


0.97 


0.98 








k = 9 






0.8 


0.87 


0.92 


0.97 


0.96 







We see that with 2 letters, as the length of the block increases, the proportion of gaps 
decreases. On the other hand for /c > 3 the opposite is true, {k = 3 seems to be a 
value close to the critical point, so this phenomenon kicks in only slowly.) Note also that 
this does not just appear for very long blocks but already for blocks of length around 5. 
Therefore, the micro-structure of the optimal alignment seems rather different, depending 
on the number of letters. 

What heuristic argument could explain why the result for artificially added 
long blocks implies the same for iid sequences? The simulations show that for 
blocks with length occurring naturally, we have the phenomenon proved for long blocks 
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added artificially. Now, if you have a block of length ^B and ^B is much smaller then the 
required c?^, then simply take the neighborhood of size ^^^^ of that block. In the optimal 
alignment of X and F, that neighborhood should also typically be aligned optimally. So 
for that part of the alignment our result should apply. 

Let us next provide simulation results where different values are given for the number 
of gaps we obtained at each run in our simulation. This should give the reader a sense 
for the order of the fluctuation of the number of gaps in long blocks, when we hold the 
length of the long block fixed. Here i is the result we obtained with the i-th simulation. 
We considered only blocks of length ^ = 100 in the next table. 





i = 1 


i = 2 


i = 3 


i = 4 


i = 5 


i = 6 


i = 7 


i = 8 


i = 9 


i = 10 


k = 2 


2 


1 


2 


31 


9 





1 


3 


5 


7 


fc = 3 


100 


97 


30 


66 


76 


79 


73 


93 


74 


91 


fc = 4 


98 


98 


99 


100 


99 


93 


99 


99 


100 


60 


fc = 8 


99 


100 


100 


95 


100 


99 


98 


98 


100 


99 



Let us analyze, for example, the number of gaps in case of = 4 letters. In two 
out of ten simulations we got 100 gaps and four out of ten we got 99 gaps, while once 
we obtained a much lower value of 60 gaps. This seems to indicate that the number of 
gaps has a strongly skewed distribution here. If we estimate the median from the ten 
simulations above for fc = 4, we find as estimate 98.5. Compare this to the estimated 
expected number of gaps with 4 letters and block length i = 100, which was 88.4. For the 
two letter case the estimated median of the number of gaps in a block of length i = 100 
based on our ten simulations above is 2.5. The estimated expected number of gaps for 
two letters and block length i = 100 on the other hand was 14.68. It thus appears that to 
take into account this skewness, a median estimation might be more appropriate than an 
expectation estimate. The discrepancy between the two-letter situation and the situation 
with more letters becomes even more pronounced when looking at the median. 

3 The proofs 

In this section we are in the setting of Section |2j Hence X and Y are two independent ran- 
dom sequences of length 2d, the string Y is iid while the string X has one long block in its 
middle and is iid everywhere else. Thus X1X2 . . . -^(i-(^/2)-i and Xd+{i/2)+iXd+{£/2)+2 ■ ■ ■ ^2(1 
are independent iid strings, and we have a long block of length approximately i in the 
middle of the string X: 

P {Xd~{£/2) = Xd-(£/2) + l = Xd-(£/2)+2 = ■ ■ ■ = Xd+{i/2)-l = Xd+{i/2)) = 1- 

Moreover, the strings consists of fc equally likely symbols: F{Yi = j) = P(Xj = j) = 1/fc, 
for all i = 1,2, . . . ,2d and j = 1, 2, . . . , fc. 
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3.1 Heuristic argument for three or more letters 



In this subsection, we assume that 



(3.1) 



and explain here why under (13. ip . the artificially added long block is mainly aligned with 
gaps (in any optimal alignment). The proof is then given in the next subsection. (See 
also I) and II) in Section O) We proceed by contradiction. Assume on the contrary, that 
there is an optimal alignment a with m symbols from the long block not aligned with 
gaps and hence aligned with symbols. Equiprobability, i.e., P{Yi = j) = l/k, for every 
i = 1,2, . . . ,2d and every j = 1,2, . . . , k, implies that in order to see m times the same 
letter in a contiguous substring of Y, we need typically a piece of length about km. Hence, 
if m symbols from the long block get aligned with symbols, then typically a piece of Y of 
length about km is needed. Let us now modify the alignment a. For this take the piece 
of Y which was used for the m symbols of the long block and align it otherwise. Let a be 
the new alignment obtained in this way. We lose m aligned letters from the long block 
but on the other hand, realigning the km symbols of Y can add about km{'jl/2) aligned 
symbols somewhere else. So the change is 



But from (13. ip . 7^ A; > 2, and so the change due to realigning the km symbols from Y out- 
side the long block, leads typically to an increase in the number of aligned symbols. This 
implies that the alignment a aligns less letter-pairs then the alignment d, and therefore a 
cannot be an optimal alignment! 

3.2 Proofs for three or more letters 

Recall that in Subsection 12.11 we defined: 

• A"^, the event that the long block is mainly aligned with gaps. 

• O'^, the event that when replacing the long block with iid symbols we get an increase 
of about ■yl/2 times the length of the long block. 

This subsection is dedicated to proving that if 7^/2 > 1/k, then both events A"^ and 
occur with high probability. In other words, this subsection is dedicated to proving 
Theorem 12.11 For this we proceed as follows: We first define four events B'^, C^, D'^ and 
F'^ and in Lemma [3.11 prove that 



In Lemma 13. 5^ 13.41 and 13. 6[ it is shown that the events B'^, and D'^ occur with high 
probability which together with Lemma 13. H then implies that A'^ also occurs with high 
probability. In Lemma 13. 2[ we prove that 




B'^ n n D'^ n F'^ c 
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and in Lemma 13.71 it is shown that F'^ has high probabihty of occurrence. Let us next 
define the event i?'^, C"', D'^ and F"', but recall first that: 

• is the number of letters in ours strings. 

• 2d is the length of the strings X and Y . 

• f3 and /3i are two constants independent of d such that 

I<l3i<l3. 

• d^ is the length of the long block which gets artificially inserted into the string X. 

• d^^ is the maximum number of symbols from the long block, which can get aligned 
with symbols instead of gaps as we will prove (holds with high probability). 

• ^1) 7fc 7fc) Iky Ik constants independent of d and such that 

l<l^<lk<ll<lt<lt< Ik. (3.2) 

Furthermore, one can think that ki is about equal to k and that 7^, 7^, 7^ and 7^ 
are all very close to 7^. 

• lk{n,p) is defined by 

, , E\LCS{V,V2V3 . . . Vn-np-, W.W^W^ . . . Wn+np)\ 

lk{n,p) := ^ i—, 

n 

where Vi, V2, . . . and W^i, W2, ■ ■ ■ are two iid sequence with k equiprobable letters 
and p G [—1, 1]. 

• 7a;(p) is the limit 

7fc(p) := lim jk{n,p). 

n—^oo 

The function 7^ is symmetric around p = and concave, and so it has a maximum 
at p = 0. To prove all the results in this article, we assume that 7^ is differentiable 
at all its points of maxima. This condition seems to be very difficult to prove, but 
from simulations of the curve 7fc(-) we have no doubt that it holds. Let [—Pm,Pm] 
with pm > be the largest interval on which 7^, is everywhere equal to its maximum 
value 7fc(0). (In other words, [— PmjPm] = 7~"'^({7fc(0)}).) Our condition simply says 
that the derivative of 7fc(-) exists at pm and is equal to zero. 

• The reals gi < < ^2 are such that 

7fc(gi) = 7fc(g2) = 7fc- 
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(From the concavity of the function jk, the reals qi, q2 G [—1, 1] are uniquely defined, 
with also 

Vge[gi,g2], Ikiq) > ll-) (3.3) 

Finally, assume also that 



7fc(P2) - 7fc(Pi 



P2 -Pl 



< , (3.4) 



for all pi,p2 e [gi, ^2]. Thanks to the condition that the derivative at pm exists and 
is therefore zero, we can always determine 7^ and 7^ so that the inequalities (13.41) 
and (13.21) hold simultaneously. For this simply keep 7^ fixed and let 7^ converge from 
below towards 7^. When 7^ gets close enough to 7^(0) = 7^, then the conditions 
are fulfilled. 



ii,i2 are defined by 



,,:=i±|?i(<i-() (3.5) 



1 - 2gi V 2 



^2:=l^(d--] (3.6) 



and 

?'o ;= , „ 

l-2g2 V 2^ 

Note that ^1,22 both depend on d. Moreover, according to our notation, when 
i G [«i,«2], then the expected LCS 

E\LCS{X,X2...X,^^e/2)■,Y^Y2...Y^\ 
= 7k{d , g), 

where d* = {d — (^/2) + i)/2 and q G [qi, ^2] and thus 7fc(q') > 7^. 

Let B'^ be the event that to find d^^^ times the same symbol in Y, at least a 
piece of length kid^^ is needed. More precisely, let B'^^i, h) be the event that every 
letter appears in the string YiYiJ^i . . .YiJ^h strictly less than h/ki times. For this, let 
r G {1, 2, . . . , fc} and let Wj{r) be the Bernoulli random variable which is equal to one 
if Y^- = r and zero otherwise. With this notation, B'^{i, h) is the event that for every 
r = 1, 2, . . . , A;, the inequality 

i+h , 

holds, and let 

B'^:= Pi B\i,h). 

i&[l,2d\,h=df^iki 
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Let be the event that when increasing the length of Y1F2 • • • by d'^^ki 
then the optimal alignment of Xi . . . Xd-(e/2) and Y1Y2 . . .Yi, increases by at least 
7^/2 times the length of the piece added for all i G [«i,«2]- More precisely, let 
C^(i, h) be the event that by adding h letters to the right of Y1Y2 ■ ■ - Y^, the LCS with 
X1X2 . . . Xd-{e/2) increases by at least /i7^/2. Hence the event C^{i, h) holds when 

\LCS{X^X2 . . . X,_(,/2); Y{Y2 . . . - \LCS{X^X2 . . . X,_(,/2); Y{Y2 • • • > 
and 

fl C^iiM. (3.7) 

i&[h,i2\,h=dhki 

where ii and 12 have been defined in (I3.5p . In a similar fashion one defines Cf{i^ h) to be 
the event that by adding h letters to the left of Y1Y2 ■ ■ - Yi, the LCS with X1X2 . . . Xd-{i/2) 
increases by at least h^ljl, and then, as above, one defines C^. Finally, let = CfuCf. 

Let D'^ be the event that any optimal alignment aligns (£/2) 1 into the in- 
terval [ii,i2]- To define the event D*^ precisely, we first need to set a convention: when 
an alignment a aligns Xi with Yj, then we say that Xj gets aligned onto the point j by a. 
If a aligns Xj onto a gap, then let ij be the largest m < i such that Xm gets aligned with 
a symbol and not with a gap. Assume that Xj^ gets aligned with Yj. Then we say that 
Xi gets aligned onto the point j by a. We define Df to be the event that any optimal 
alignment a of X and Y, we have that: 

• Hi designates the place where Xd-(i/2)-i gets aligned to by a, then i G [«i,«2]- 
We define Djj to be the event that any optimal alignment a of X and Y, we have that: 

• Hi designates the place where X^_(-^/2)-i+A,id'3i g^ts aligned to by a, then i G [ii,i2]- 
Finally 

D'^ := D'} n D'jj. 

Let be the event that increasing the length of X*X2 . . (f/2) 1 by d'^, the 
alignment of Xj^X^ . . . X^ (^^2) 1 ^1^2 • • • increases by at least 7^/2 times 
the length of the added piece for all i G [ii,i2]- More precisely, let Ff be the event 
that 

\LCS{X*X2 . . . X^_^_(^/2); ^1^2 ■ ■ - Yi)] - \LCS{X^X2 . . . Xl_^(^i2yYiY2 ■ ■ .Yi)\ > 

and let F"^ be the event that Ff holds for every i G [ii,'j2]- 
Next comes the first combinatorial lemma of this subsection: 

Lemma 3.1 
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Proof. Th eproof is by contradiction. Assume that A'^ does not hold. Then there is 
an optimal alignment a for which there are at least letters from the long block not 
aligned with gaps. If the event B"^ holds however, those d^'^ letters from the long block are 
aligned with a portion of Y1Y2 . . . of length at least d^^ki. In other words there exists 
i G [l,2(i] such that the optimal alignment a aligns d^'^ letters from the long block with 
YiVij^i . . . Yi+h where h > d^^ki. Hence the optimal alignment a aligns ViFj+i . . . 1^+^ only 
with those c?'^^ letters from the long block. By the event D''^ we can assume that i G [ii, 'i2]- 
Let us modify the alignment a, and no longer align the d^^ letters from the long block 
with YiYi^i . . . Yi^h- So, we lose d^'^ aligned letters. But now we realign YiYi^i . . . Yi+h 
with a part of X outside the long block. In other words we now align 1^11^2 . . . Yi^^ entirely 
with X1X2 . . ■Xii-{i/2)- By the event we are guaranteed to gain at least 'jlh/2, and 
since h > d^^ki, this is at least d^^ki'-)1/2. Summing what the loses and the gains our 
modification of a leads to an increase of at least 

- = rf.. f M _ iV (3,8) 



By the inequality (13. 2p . > 2, and therefore (13. 8p is strictly positive. This implies 
however that we can modify the alignment a to improve it, and so a is not an optimal 
alignment. This is a contradiction, and B'^, and D'^ together imply A'^. ■ 
Let us now state and prove the second combinatorial lemma of this subsection: 

Lemma 3.2 

B'^ n n D"^ n F'^ c O'^. 

Proof. When the events B'^, and D'^ all hold, we have, by the previous lemma, that 
the long block contains at most d^^ letters which are not matched with gaps in any optimal 
alignment. Let a be an optimal alignment of X and Y, then a aligns at least d^ — d^^ 
symbols from the long block with gaps. Assume that X'^~^^^'^^~^ gets aligned to Yi by a. 
We are going to modify a in order to obtain a new alignment a* to align X* and Y. Let 
a* be obtained from a in the following manner: instead of aligning X^ . . . X^_^^^^^ with 
Y1Y2 . . .Yi add to the X-part 

^*d-{l/2)+l^*d-{l/2)+2 ■ ■ ■ ^*d+(l/2)- 

So, now we align Y1Y2 . . .Yi with XIX2 . . . -^^+(£/2) optimal way. (By "optimal way" 
we mean, that we align 1^112 . . .Yi with Xj*X| . . . -^^_,_(^/2) choosing any alignment which 
corresponds to a LCS of Y1Y2 . . .Yi and X^X| . . . X^_|_^^y2)-) To align the remaining letters 
of the strings we use the alignment a. Hence, if m G 2(i] and n G [(i+(£/2) + l, 2d] and 
if a aligns X„ with Ym, then a* is aligning X* with Ym. By the event D'^ we have that i G 
[ii, i2\. Hence, by the event F'^, adding -^rf_(^/2)+i"^d-(^/2)+2 • • • -^d+{i/2) ^'^ X-part leads 
to an increase of at least (7^/2) d''. But the modification also means that some losses can 
occur: there could be from the long block up to d^^ letters which under a are not aligned 
with gaps. Therefore replacing the long block by the piece -^rf_(£/2)+i-^d_(£/2)+2 • • • ■^d+{i/2) 
we might lose up to d^^ aligned symbol-pairs. Summarizing, we have a gain of at least 



17 



(7fc/2)(i^ and a loss not exceeding d^i. Thus, the total change causes an increase of at 
least 

2 

This proves that the event O'^ holds and finishes this proof. ■ 
As already mentioned, in order to prove that both A'^ and O'^ hold with high probability, 
it is enough to prove that B'^, , D'^ and F'^ all have high probability which is what we 
do next. To prove that holds with high probability we need the following: 

Lemma 3.3 Let t := . For all d large enough and all i G [ii, d] (see (13. 5p ), 

n\LCS{XlX; . . . n>^2 . . . Y,+u)\ - \LCS{xix; . . . x:_^,/,y,Y,Y^ . . . > -f. 

(3.9) 

where h = kid^^ . 

Proof. Let EA denote the expected difference 

E[\LCS{X^X2 . . . X^_(^/2)^ ^1^2 • • • Yi+h)\ - \LCS{X*X2 . . . X^„(£/2); "^1^2 • • • 
By definition. 



where 



EA = rf27fc(<^2,P2) - di-fkidi^pi] 



d2 (d-^ + i + h 

d, ■■=-^[d--^+^ 

i + h-d+ U/2) 

P2 — 



Pi 



2{d - {i/2) +i + h) 

i-d+ {i/2) 
''2{d~{i/2)+i) ■ 



We know from Alexander [T] that there exists a constant C > (not depending on d or 
q) such that 

|7fc(f^,g)-7fc(g)l<^7^, (3.10) 
for all g G [—1, 1] and all c? G N. Using f l3.10p and since di, ^2 < 3(i, lead to 

EA > d2-ik{p2) - di-fkiPi) - 2CV3 \n3dVd. (3.11) 

Now 

d2lk{p2) - c^i7fc(pi) = ^^-^ + d2^ Ap (3.12) 

/ Ap 
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where 

A7 = 7fc(P2) - 7fc(Pi) 

and 

Ap = p2 -pi. 

Now p —7- '~fk{p) is concave and symmetric around 0, it is thus non- decreasing for p < 
and non- increasing for p > 0. Thus if p2 < 0, then > so that 

d2lk{p2) - dMPi) > (3.13) 

If P2 > 0, on the other hand, we find 

^ i + h-d+ {£/2) h 
" - 2{d - (£/2) +t) 2{d - (£/2) +t)- d' ^ ' 

where the rightmost inequahty above follows from 2(c? — (£/2) + i) > d, which is valid 
for d large enough. Since i G [^1,^2] and i + kid^^ G [«i,i2], then pi,p2 G [q'i,q'2] and the 
inequality (13. 4p leads to 



Ap - 12 ■ 

Using the above together with (I3.14p and the fact that ^2 < 3(i, we find that 

+d2^Ap>h(:^- hl^] . (3.15) 



2 Ap 

Note however that since i G [^1,^2], we have pi G [5'i,5'2] and hence 7fc(pi) > 7^ (by 
definition of 12), which yields 

^I^T^.M^^ >/,(^|_M^^. (3.16) 

Now, flXT^ together with f lXT^ . flXTTD and flXT^ imply 

EA > hd^' + - 2cy3hr3rfyrf. (3.17) 

Note that since /3i is a constant independent of d and strictly larger than 1/2, and since 
7^ — 7^ > 0, the expression 2c^/3\n3d^/d becomes "negligible" in comparison to d^^{'yl — 
7^). So for large enough d, (I3.17P implies that 

EA > kid^''^, 
_ 1 2 ' 

which is what we intended to prove. ■ 
The next lemma shows that the event C"^ occurs with high probability: 
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Lemma 3.4 For d large enough, 



P(C"^) > 1 - 2rfexp 
Proof. Let A denote the difference 
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\LCS{XIX2 ■ ■ ■ X^_(^/2)^ ^1^2 • • • yi+h) \ - \LCS{X^X2 . . . X^_(^/2)^ ^1^2 ■ ■ ■ 

and let h = kid^^ . Let {Cf{i,h)Y be the complement of the event Cfii^h) (see ( 13. 7p ). 
Note that when {Cf{i, h)Y holds, then 

A < (3.18) 

Moreover, A is function of the iid entries X^, , . . . , -^rf_(^/2) ^"^^ ^i, ^2, • • • , ^i+h and 
changing one of its entries changes A by at most 2 and that d — {1/2) + i + h < 3d. 
Assuming that i G [ii, d], Lemma [3.31 applies and EA > 7^/1/2. Together with fl3.18p . this 
leads to 

A-EA<M_^^ = M^. (3.19) 
- 2 2 2 ^ ' 

In other words, the event {Cf{i, h)Y implies that the inequality (13.19^ holds true. Hence, 
1 - P(C,^(i, h)) < P f A - EA < rf* (7g - ll) \ ^g ^g) 



d* J 2 

where d* = d — {i/2) + i + h. Note that by assumption 7^ — 7^ < and therefore by 
Hoeffding's inequality, the right hand side of f l3.20p is upper bounded by 

exp I -rfM i V ^^^^^^ 1 . (3.21) 



d 

We have d* < 3d and for d large enough d* > d, therefore f l3.20p becomes 
1 - FiC^{^, h)) < exp (^-d ' A;?M_i£ 

and hence 

1 - F{Ctit, h)) < exp (^-rf2/3i-iA;2Mz_2i)! 
Since in the interval [ii,d\ there are less than d elements 

1 - net) < de^V (^-d^P^-^kl^^^^^^^ 
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A symmetric argument leads to the same bound for Cf so that 

1 - P(C"^) < 2dexp (^-^2/31-1^2 (2iz_Z|)! 

Note that 2/3i — 1 > 0, so that the above is really a negative exponential bound in a 
fractional power of d. ■ 
The next lemma shows that the event B'^ occurs with high probability. 



Lemma 3.5 For d large enough, 

P(5'^) > l-2dA;exp ( -2h 

where h = kid'^^. 



ki k 



Proof. Let B^{i, r) be the event that r e {1, . . . , /c} appears strictly less than h/ki times 
in the string 1^1^+1 . . . i^j+fe. Hence, 

fl B%r)cB'{z), 
re{i,...,k} 

and since all the letters have equal probabilities: 

F{{B'^y{i)) < kF{{B'^y{i, 1)). (3.22) 

Now if -B''(z, 1) does not hold true, then the letter 1 appears more than h/ki times in 
YjYi_^i . . .Yi_^h. Let Wj be the Bernoulli random variable which is equal to one if 1^ = 1 
and zero otherwise, and with this notation the event {B'^y{i, 1) is the event that 



J]V^,>- (3.23) 

j=i 

holds true, where again by equiprobability, F{Wj — 1) — EWj — 1/k. Hence 

(i+h /i+h \ \ 

-'^[^Wj] > h/ki -h/k \ , (3.24) 

and since {1/ki) — {1/k) > another use of Hoeffding's inequality leads to 

l-P(5'^(z,l))<exp||-2/i(^l-i^ ). (3.25) 

Since, 

{B'r= u {B'r{t,r), 

ie[l,2(i],rG{l,...,fc} 
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then 

F{{B'^Y)< J2 mB-^y {i, c)) = 2dkF{{B'^y (1,1)). 

ie[l,2d],re{l,...,k} 

Applying f l3.25p we find 



1 V'' 



rUB'Y) < 2dkexp I -2/i ( - f 

ki k 



Next we are going to prove that the event D'^ occurs with high probabihty: 
Lemma 3.6 For d large enough, 



^{0") > 1 - 4dexp -d 
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Proof. Recall that D'^ = Dj fl Dj^. Next, let Dj{i) denote the event that there exists an 
optimal alignment of X with Y aligning Xfi-{t/2)-i to the point i. Now for Dj to hold it 
is enough that none of the events Dj{i) hold for all i G [0, ii] U [22, 2d]. Hence 

fl {Dim C D'}, 

ie[0,ji]U[i2,2d] 



and hence 

iG[0,ii]U[i2,2d] 

SO that 

niDir)< E ^(^/W)- (3-26) 

i€[0,h]U[i2,2d] 

Let L{i) denote the maximal score when leaving out the big block and giving as constraint 
that Xd-(i/2)-i gets aligned to the point i, i.e., 

L(z) := \LCS{X,X2 - ■ ■ Xd-^e/2)-i;YiY2 - ■ - Yi}] 

+ \LCS{Xd+(e/2)+lX(i+{e/2)+2 ■ ■ ■ X2d'i ■ ■ ■ '^2d)\- 

When Dj(i) holds, then 

m + 2d^ > Lc;,, 

where LC2j^ is the length of the longest common subsequence of X^X2 . . . X2^ and Y1Y2 . . . Y2d- 
Indeed, between X and X* there are less than d^ bits changed, and so the difference in 
length between the LCS of X and Y and the LCS of X* and Y is at most d^ . Also, if 
Dj{i) holds then the difference between L{i) and the length of the LCS of X and Y is at 
most d^ . This then implies that the difference between L{i) and i^Cg^ is less than 2d^ , 
when the event Dj{i) holds. Therefore, 

P(Df (i)) < P(L(z) + 2d'^ > LC;^). (3.27) 
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Now 

F{L{i) + 2d^ > LC;^) = P(L(i) -LC*rf-EL(i) + ELC*^ > ELC;a-EL{i) -2d^). (3.28) 
But as (i cx), ELC2d/'2d 7^, and via f|L2|) 



ELC;^ > 2d-fk - CVdlnd. (3.29) 
On the other hand, by definition 

EL{i) = di7fc(pi, di) + (i27fc(P2, ^^2), (3.30) 

where 

d, ■■=l{d-^- + ^ 
and 

_ i -d+{e/2) _2i-2d + i 
~ d-{£/2) + i ~ 2i + 2d-i' 

Moreover, 

7fc(P2,c^2) < 7fc, (3.31) 

and 

i 

di + d2 = 2d--< 2d. (3.32) 
Now if z ^ "^2], then, by definition of ii, 12 we have 

lk{vi)<ll- (3.33) 

and by a sub-additivity argument, 

7fc(pi) = hm -fkiPud) 

a— >oo 

is larger or equal to 7fc(pi, (i), for every d eN, hence: 

IkiPudi) < 7fc(pi). 

Applying this last inequality with fl3.3ip . 03.32 p . fl3.33p to fl3.30p and assuming that d is 
large enough so that di > d/i, we find 

EL{i) < 2^7* - "^^^^ . (3.34) 

Combining fl3.34p and fl3.29p . we obtain that 

E(LC*rf - L{t)) - 2d^ > ^^^^ 7 - CVh^dVd - 2d^ (3.35) 
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Now by definition of 7^ we liave 7^ — 7^ > 0. Also /3 < 1. So for d large enougli, tlie riglit 
liand side of fl3.35p is at least (i(7^ — jD/S so that 

E(LC2*d - L{i)) - 2d^ > "^^^^ ~ . (3.36) 

8 

Using f l3.36p with fl3.27p and fl3.28p . we get that for d large enough 

P(^/(i)) < P {Hi) - Lc;^ - EL{i) + el;^ > ^Mzilll ) . (3.37) 



Applying Hoeffding's inequality for i outside [^i,'j2], 

< exp (^-dM_^^ . (3.38) 

Applying f l338|) to fl3:26|) we get 

ie[0,ii]U[i2,2d] ^ ^ 

The same bound can be found for F{Djj), which finished proving this lemma. ■ 
Finally we need one more lemma: 

Lemma 3.7 There exists a constant k,> 0, independent of d, such that 
for all d > 1 . 

Proof. The proof is very similarly to the proof that P(C"^) occurs with high probability 
(see Lemma [3.41) . so we leave it to the reader. ■ 



Proof of Theorem 12.11 This is the main theorem for three or more letters case. (That 
is 7fc/2 > 1/k.) It states that the events A"^ and O'^ both hold high probability. Hence, 
it shows that for d large enough, typically with three or more letters, the long block 
gets mainly aligned with gaps. Moreover, it asserts that replacing the long block by iid 
typically leads to an increase in the LCS which is linear in the length of the long block. 
Let us first handle A'^. By Lemma [3.11 we have 

B'^nC'^nD'^c A\ 

and therefore, 

¥{{A'^f) < ¥{{B'^f) + P((C"^)^) + ^{{D'^f). (3.40) 

Lemma [3.41 tells us that P((C"^)'^) is of exponential small order in c/^^^"^. Lemma [3.51 tells 
us that '^[[B'^Y) is of exponential small order in d^^. Lemma [3.61 tells us that P((D'^)'^) is 
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exponentially small in d. Therefore, for /3i G ] 1/2, 1 [, ^'{{A'^Y) is also exponentially small 
in cP^^~^. Hence there exists a constant C > 0, independent of d (but depending on k) 
such that 

Let us next turn our attention to the event O'^. From Lemma I3.2[ 

P((0'^)") < F{{B'^y) + FiiC^Y) + P((D'^)") + FiiF'^Y). (3.41) 

We have already seen that F((B'^Y) + P((C"^)^) + F({D'^Y) is of exponential small order in 
^2^i-i_ gy Lemma 1321 f{{F'^Y) is of exponential small order in d'^^~^. Now we assumed 
that Pi < (3 and so 2/3i — 1 < 2/3 — 1. The right side of fl3.4ip is thus exponentially small 
in d^^^~^. Hence there exists a constant C > not depending on d such that 

3.3 The binary case 

Let X = 10010111100000101101101 and y = 01111001011011011101001 be two strings of 
total length 23 with x containing one long block 00000 of length i = 5. We consider three 
alignments of x and y. First, we take an alignment which aligns the long block with no 
gaps. The alignment aligning the long block with zeros is given by: 



y 











1 




111 




00 


1 





11 





11 







1 




1101 










1 


X 




1 








1 





nil 




00 



















1 





1101 


1 







1 



Here the long block 00000 gets aligned with the piece 0010110110 from y. Note that 
this piece from y has length 10 which is twice the length of the long block. This is to be 
expected since the probability of is 1/2, we need a string of length about 2£ to get £ 
zeros. Here the alignment can be viewed as consisting of three parts: the part to the left 
of the long block in x, the aligned long block, and the part to the right of the long block. 
The part to the left of the long block aligns 5 letter pairs. The long block here gives 5 
letter pairs and the piece to the right of the long block gives 7. Hence the total number 
of aligned letter pairs in this alignment is 5 + 5 + 7 = 17. 

Let us next try an alignment which would align the long block with a piece of string 
of similar size, i.e., of size 7. We can take the alignment 



y 







1 




111 










1 







11 





11 









1 




1 


1 


01 














1 


X 






1 


00 


1 





11 




1 


1 


000 




















1 




01 


1 





1 


1 





1 



(3.42) 

Here the alignment gets us 4 + 3 + 6 = 13 aligned letter pairs which is less than the 
previous alignment. This is exactly what we were predicting: when we align the long 
block entirely with letters and no gaps, then the score tends to be higher. (Of course 
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we expect this phenomenon to be even more pronounced when £ gets larger.) In our 
second ahgnment above we have that the long block gets aligned with the piece of string 
0110110. Two of the zeros of the long string are aligned with gaps. We are now going 
to show how introducing a small modification to the second alignment can increase the 
score. For this we take the two zeros from the long block which are aligned with gaps and 
align them with zeros from the string y. We take a piece of y to the left of y^ containing 
two zeros. To get two zeros to the left of ?/9, we need to take yeyjys = 001. We now align 
the two "unused" zeros from the long block with and yj, the two "unused" bits we are 
referring to are Xio and Xu . By aligning the two "unused bits" we gain two points, but at 
the same time lose something, since before y^yrys was aligned to x^xq . . .xs = 0111. So, 
now the previous alignment of y^yiys with y^yrys gets destroyed and we lose two points 
for that. However x^Xq . . . Xg is now "free," so we are going to "include" this string into 
the alignment of i/ij/2 • • • Z/s with xi . . . x^, meaning that we align XyXs with y^y^. This 
addition will give us two points, and the total gain is 2 — 2 + 2 = 2. Let us represent 
what is happening in "toy" form. There are three phases. We start with the following 
situation (we only show below the part of the alignment we are going to modify): 



y 







1 




111 










1 








X 






1 


00 


1 





11 




1 


1 


00 





The first phase consists in aligning the two unused bits a;ioXii from the long block with 
yjys- This leads to: 



y 







1 




111 












00 


1 




X 






1 


00 


1 





11 




1 


1 


00 







So we gained two aligned letters because XiqXu gets aligned with y-rys, but at the same 
time we lose two aligned letter pairs, because before x^xqXjXsXq had two aligned letters 
and none now. Next we "bring the string x^x^XjXgXg = 01111 into" the alignment 



y 







1 




111 


X 






1 


00 


1 



(3.43) 



In the alignment (13.431) we see that there are two ones on the right end y which are free 
(i.e. the bits y^y^). We are going to align two of the ones from 0:50:60:7X8X9 and gain two 
additional aligned letter pairs and the end result then looks like: 



y 







1 




1 




11 








00 


1 




X 






1 


00 


1 





11 




1 


1 


00 







The total change is2 — 2 + 2 = 2>0. Hence, the alignment (I3.42p cannot be optimal 
since we can improve it by at least two units. 

In the previous example, we had two zeros from the long block not aligned with gaps. 
Assume that instead of 2, we would have j, where j is not too small. Then to align these 
j zeros with zeros from the string y we would need a string of length about 2j in y (in 
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order to find j zeros since the zeros have about 1/2 probability). (In the example above, 
the piece of string with which we aligned the free zeros from the long block had length 
3.) Before we change the alignment, these 2j bits from y where most likely aligned with 
about 2j bits from x. (In our example, these bits of x are: x^XqXjXsXq.) When wc align 
these additional bits which become free we expect to gain about 72(2j)/2 = 72 j ~ 0.8 j. 
The argument being that with two sequences of length j we have a total of 2j bits and 
a score about equal to 72 j. Hence, the ratio score/bits is 72/2. (Using this average is 
a purely heuristic argument, since we have no proof that adding bits on one side of an 
alignment only produces an average increase of 72/2 per bit.) Summing up: 

a) Prom the new alignment of the j free bits from the long block we gain j points. 

b) Losing the previous alignment of the piece of string of y which newly gets align with 
the free bits of the long block, makes us lose about 722j bits. This is so because 
that piece has length about 2j. (It is the piece which corresponds in the example 
to ViVsy^-) 

c) Finally, we gain by realigning the piece of x which was previously aligned with the 

piece of y getting aligned now with the free bits of the long block. (In our example 
these are the bits X^XQX'jX^X(j. ) The length of this piece of string should be about 
2j, so the gain should be about 2^72/2 = 72 j. 

Hence the grand total of the change is 

3 - 272i + 1*2 j =j- 1*2 j ~ 0.2j > 

From what we saw in the previous example with 2-letter-strings, the tendency is to 
align long block with almost no gaps. How much do we gain by replacing in the 2-letter 
case the long block by an iid piece? The answer is as follows: for the long block we get 
I points, but use a piece of length 21 in the y-string for this. That piece becoming free 
we gain 21 bits plus the ^-bits from the long block. Hence before we use 3£ bits to get £ 
points. If we believe in our "average point/bit number hypothesis" of 7|/2, we can gain 
about 3^72/2 aligned letter pairs. That is after replacing the long block by a regular iid 
piece, and realigning all the 3£ bits which before where used with the long block. Hence, 
we should get about 3^72/2 — I ^ 0.215^ additional letter pairs. We will see that this 
number is extremely close to the number we get in our simulations. 

With a long block of length 20 we should gain about on the average 4 points. 

3.4 Simulations 

Let us analyze the result of our simulation. In the next table wc give as entries the 
difference between the LCS when we replace the long block with iid entries. The sequence 
length is 1000. Hence according to our notation 2d = 1000. In what follows i is the block 
length and k is as usual the number of letters. We ran for each entry 100 simulations and 
then took the average. 
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C = 1 


(' = 2 


c = .-. 


( = 10 


(. = 20 


I:. = 30 


( = .-.0 


i. = 100 


( = 200 


( = 300 


/ = 400 


k = 2 


-0.01 


0.06 


0.52 


0.9 


2.88 


4.7 


9.48 


21.8 


44.3 


73.5 


88.5 


fc = 3 






0.45 


1.36 


4.55 


7.62 


14.5 


32.8 


68.9 






fc = 4 


0.03 


0.06 


0.59 


1.85 


5.3 


8.86 


14.32 


31.4 








fc = 5 




0.2 


0.58 


1.78 


4.88 


7.81 


14.18 


29.9 








fc = 6 




0.1 


0.53 


1.86 


4.42 


7.7 


12.8 


27.9 








k = 7 




0.13 


0.7 


2.05 


4.7 


7.3 


13.1 


27.28 








fc = 9 






0.7 


1.85 


4.33 


7.26 


11.6 











In the next table we display which values we would expect from our heuristic argument 
for the typical increase in LCS for long blocks and the values we obtained through sim- 
ulations. First we start with two letters. In that case, our predicted change in LCS due 
to the replacement of the long block is (72/2)3£ — I ^ 0.215£, for 2 letters and a block of 
length £. 





i = l 


1 = 2 


e. = h 


£ = 10 


£ = 20 


£ = 30 


£ = 50 


£= 100 


£ = 200 


£ = 300 


£ = 400 


0.215£ 


0.215 


0.43 


1.07 


2.15 


4.3 


6.45 


10.7 


21.5 


43 


64.5 


86 


EA 


-0.01 


0.06 


0.52 


0.9 


2.88 


4.7 


9.48 


21.8 


44.3 


73.5 


88.5 



Let us next compare the predicted values with our simulated values for the 4-letter case. 
For the 4 letter case we expect an increase of {'yl/2)£ 0.325£. 





£=1 


£ = 2 


£ = 5 


£=10 


£ = 20 


£ = 30 


£ = 50 


£ = 100 


£ = 200 


£ = 300 


£ = 400 


0.325£ 


0.325 


0.65 


1.62 


3.25 


6.5 


9.7 


16.2 


32.5 








EA 


0.03 


0.06 


0.59 


1.85 


5.3 


8.86 


14.32 


31.4 









Let us next compare the predicted values with our simulated values for the 5-letter case. 
For the 5 letter case we expect an increase of (7|/2)£ 0.305£. 





£=1 


£ = 2 


£ = 5 


£ = 10 


£ = 20 


£ = 30 


£ = 50 


£= 100 


£ = 200 


£ = 300 


£ = 400 


0.305£ 




0.61 


1.52 


3.05 


6.1 


9.15 


15.2 


30.5 








EA 




0.2 


0.58 


1.78 


4.88 


7.81 


14.18 


29.9 









Finally, let us compare the predicted values with our simulated values for the 7-letter 
case. For the 7- letter case we expect an increase of {'y^/2)£ 0.27£. 





£ = 1 


£ = 2 


£ = 5 


£=10 


£ = 20 


£ = 30 


£ = 50 


£= 100 


£ = 200 


£ = 300 


£ = 400 


0.27£ 




0.54 


1.35 


2.70 


5.4 


8.10 


13.5 


27.00 








EA 




0.13 


0.7 


2.05 


4.7 


7.3 


13.1 


27.28 









We see that with more letters, the approximation is already good for blocks of lesser size. 
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3.5 Proofs for binary strings 

The purpose of this section is to prove Theorem 12.2^ and so here, the j^k < 2, which 
corresponds to the 2-letter case k = 2. Also, everywhere in this subsection, the sequences 
X and Y are binary sequences of length 2d. In the middle of the string X there is 
one artificially placed long block with length about i = d^. Everywhere else X is iid. 
Similarly, Y is iid and independent of X. Theorem 12.21 states that typically, for d large 
enough, the long block gets mainly aligned with symbols and not with gaps. The event 
stating that the long block gets mainly aligned with symbols was defined in Subsection 12.11 
Theorem 12.21 also asserts that replacing the long block by iid typically leads to an increase, 
in the LCS, linear in the length of the long block. The event H'^ which describes the 
increase in LCS due to the replacement of the long block was defined in Subsection 12.11 
So, the aim of this section is to prove that both events hold with high probability. This 
is done in a very similar way to what we did for the 3-or more letter case. 

Let kji and 7^^ be two constants, independent of d, such that ku > 2 and •y^^ > 73, 
but also such that fc//7|^ < 2 (this last choice is certainly possible since 72 < 1). Actually, 
for the argument which follows, any values ku > 2 and 72^ > 72 will do, provided the 
constants are close enough to their respective bounds and do not depend on d. 

Recall that denotes the event that the long block gets mainly aligned with sym- 
bols and not with gaps. More precisely let be the event that the long block has at 
most (i^^ of its symbols aligned with gaps in any alignment corresponding to a LCS (see 
Subsection 12. ip . 

Let Bfj be the event that in any piece of Y of length kjjd^^ , there are no less than d^^ 
ones and zeros. More precisely, let Bjj{i) be the event that 

i+h 

j=i 

and that 

i+h 

j:\Y,-l\>d^^, 
j=i 

where h = kud^^ . Finally let 

Let Cfj be the event that an increase of the length of Yi . . . by kjjd^^ leads to an 
increase of the LCS of Yi . . . and Xi . . .Xd-(e/2) of no more than kiid'^^'y2^ /2 for all 
i + kiid^'^ G [ii,'i2]- More precisely, let Cfj be the event that 

\LCS{X^ . . . X,_(,/2), Yi ...Y, + h)\- \LCSiXi . . . X,_(,/2), Fi . . . < 
where h := kjjd^^. Finally, let Cfj be defined by 

i+h£[ii,i2] 
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Lemma 3.8 

Proof. The proof is by contradiction. Assume that a is an optimal ahgnment of X and 
Y ahgning more than d^^ symbols, of the long block, with gaps. Again let h := knd^\ 
Assume that a aligns X^_(^/2) with j. Then because of D''^ we have j G [«i,i2]- Let 
i := j — h, so that i + h G [«i,«2]- Thus Cfj "applies" to i, meaning that when we 
"take out" the piece YiV2 ■ ■ - Yj from the alignment a we lose no more than h'y^^ /2. Now, 
because of the event Bf^, the string YiYi^i . . . Yi^h contains the symbols the long block is 
made of, at least d^^ times. Hence we can align the c?^^ symbols from the long block which 
are aligned by a with gaps, and align them with symbols contained in the piece of string 
YiYi^i . . . Yi^h- Let a denote the new alignment obtained from modifying a in this way. 
We gain d^^ aligned symbols from the long block (which where aligned with gaps) and 
now are aligned with symbols, but from the previously aligned symbols from Yi . . . Yi^h 
we could lose as many as h'^l^ j2 aligned symbols pairs. Hence the gain is at least 

2 V 2 

Our constants, ku and where picked so that 1 — {kn^l^^jl > 0. Thus (13.441) is strictly 
positive, and hence a aligns more letter pairs than a. This implies that a is not optimal, 
which is a contradiction. Therefore, it is not possible for symbols of the long block to 
get aligned with gaps when Bfj, D'^ and Cfj all hold. This proves that Bfj, D'^ and Cfj 
together imply that any optimal alignment a of X and Y can align at most c?^^ symbols 
from the long block with gaps. Hence, Bfj, D'^ and Cfj jointly imply C^. ■ 

High probability of C^. In Lemma 13. 8[ we proved that the events Bjj, D'^ and Cfj 
jointly imply C^, and so 

P((G"^)^) < FiiBfjY) + F{{D^Y) + mCfjY). (3.45) 

Now in Subsection 13. 2[ we have shown that F{{D'^Y) is negatively exponentially small 
in d. In the same section it was also shown that F{{B'^Y) is negatively exponentially 
small in d^'^, while P((C"^)'^) is negatively exponentially small in d"^^^'^. In the current 
section we do not use B"^ and C^, but Bfj and Cfj instead. However, the proof of the 
high probability of occurrence of Bfj and Cfj, is almost step by step the same as for 
B'^ and C^. So, we leave it to the reader. The order of magnitude for the probability 
bounds are the same. Hence F{{BfjY) is of order exponentially small in c?^^ and F{{CfjY) 
is exponentially small in As mentioned before, /3i > 2/3i — 1 since Pi G ]l/2, 1 [. 

These orders of magnitude together with inequality (I3.45P imply that 

niG'^Y) < e-^^"'"^" 

for all d, where Cq > is a constant not depending on d. This finishes establishing the 
high probability of the small proportion of gaps aligned with the long block. 



(3.44) 
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The rest of this subsection is devoted to the increase of the LCS when replacing the 
long block with iid symbols. Thus, from here on this section is about proving the high 
probability of occurrence of the event W^, which states that the increase in the LCS is at 
least ch > ^ times the length of the long block. (Recall that ch is the constant used in the 
definition of the event H'^. It can be any positive real smaller than 872/2 — 1, but must 
not depend on d.) Recall that X contained a long block in the interval [d — £/2, d + 1/2] 
which means 



P A'i = X^,, Vie 



1. 



The string X* is obtained by replacing the long block in X by iid. This means that 
X* = Xi for all i ^ [d - {i/2),d+ {i/2)]. Let ku < 2 and 72 < 72 be two constants 
not depending on d. We take ku extremely close to 2 and 72 extremely close to 72 and 
request moreover that 

— 1\>ch (3.46) 

(Note that 872/2 — 1 is approximately equal to 0.2 > 0.) Once > is determined, we 
define ku and 72 in the following way: since ch is strictly smaller than 872/2 — 1, we can 
simply take 72 close enough to 73 and ku close enough to 2, and we will get that f l8.46p 
holds. The exact values of ku and 72 are not important, as long as fl8.46p is satisfied as 
well as ku < 2 and 72 < 72- 

Next, let Bjj be the event that in any piece of Y of length kud^ there are strictly less 
than d^ — d^^ zeros and ones. More precisely, let Bfj{i) be the event that 

i+h 



Y,Y,<d^-d^\, 



j=i 



and that 



where h = kud^ . Finally let 



i+h 



Y^\Y^^l\<dP -d^\ 



2d-h 
i=l 



Let Cfj denote the event that an increase of the length of Yi . . .Yi by kud'^ and an 
increase of Xi . . . Xd-{e/2) by d'^ leads to an increase of the LCS of more than d'^{l+ku)j2/2 
for all i G [ii,i2]- More precisely, let Cfj{i) be the event that 

\LCSiX, . . . X,_(,/2)+,., Fi . . . r.+,J|-|LC5(Xi . . . X,_(,/2), Fi . . . Y^\ > d^ ( Il±Ml^ 
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where hy := kud^, and let 

Recall that H'^ is the event that replacing the long block by iid symbols increases the LCS 
by at least cud^ . More precisely, is the event that 

\LCS{X\Y)\ - \LCS{X,Y)\ > cnd^. 

Lemma 3.9 

Proof. Assume that Bfj, Cfj and D'^ all hold true. Let a be an optimal alignment of 
X and Y. We know that when Bfj, Cfj and D'^ all holds, then also holds. Hence, a 
aligns at least d^ — d^"^ symbols from the long block with symbols. Let [«, j] denote the 
interval on which the long block gets aligned to by a. By this we mean that Xd-{i/2) gets 
aligned by a into i while Xrf+(^/2) gets aligned into j. Then, since there at least — d^^ 
symbols aligned with symbols from the long block, by the event Bj^ we have that j — i 
must be larger/equal than kud^ (in order to contain sufficiently same symbols). Now we 
are going to modify the alignment a to obtain an alignment a to align X* and Y . (Note 
that a is an optimal alignment of X and Y .) The new alignment d is identical to a in the 
way it aligns Xd+{t/2)+iXd+(t/2)+2 ■ ■ ■ with Yj+iYj+2 ■ ■ ■ Y2d, but instead of aligning the 
long block to YiYi+i ■ ■ - Yj, it now aligns X]*X| . . . X^+(^/2) to Y1Y2 ■ ■ - Yj. By the event D"^, 

we know that i G [^1,^2] and hence we can apply Cfj to i. This then yields that the gain 
by aligning XIX2 ■ ■ ■ -^^+(^72) ^11^2 • • • (instead of just aligning XIX2 . . . -^^-(^72) 
Y1Y2 . . .Yi) is d^{l + A;//)72/2. There is a loss of at most d'^ symbols from the long block, 
so the improvement is of at least 

and therefore the event H'^ holds true. This finishes this proof. 



High probability of H . Lemma [3.91 implies that 

F{{H'^Y) < F{{BfjY) + FiiCfjY) + F{{D'^Y). (3.47) 

In Subsection 13.21 we have shown that F{{D'^Y) is negatively exponentially small in d 
and that F{{B'^Y) is negatively exponentially small in d^^, while P((C°')^) is negatively 
exponentially small in d^^^"^. In the current section we do not use B'^ and C^, but Bfj 
and cfj instead. However, the proof of the high probability of occurrence of Bfj and 
Cfj, is almost step by step the same as for B'^ and C^. So, we leave it to the reader. 
The order of magnitude for the probability bounds are the same, hence F{{BfjY) is 
exponentially small in d^'^ and F{[CfjY) is exponentially small in d'^'^^~^. As mentioned 
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before, (3i > 2/3i — 1 > since (3i G ]l/2, 1 [. These orders of magnitude together with 
inequahty f l3.47p imply that 

for all > 1, where Ch > is a constant not depending on d. This finishes establishing 
the high probability of the increase due to replacing the long block by iid symbols. ■ 
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